Specialization of: Business Agility
Site Reliability Engineering (SRE)
SRE blends software engineering with operations to achieve reliability at scale.
WHAT YOU’LL LEARN
- Translate customer expectations into SLOs and error budgets.
- Run on-call, triage incidents, and conduct blameless postmortems.
- Instrument services for logs, metrics, traces, and APM.
- Use chaos experiments and capacity planning to prevent failures.
CORE TOPICS (Clusters)
- SLOs & Error Budgets
- Incident Response & On-Call
- Observability & APM
- Chaos Engineering
- Capacity & Scaling
OUTCOMES
- Measurable reliability aligned to user experience.
- Faster MTTR, fewer severe incidents.
- Proactive resilience through testing and capacity models.
Prerequisites: Basic production experience and monitoring fundamentals.